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Many methods have been developed for data clustering, such as k-means, expectation maximiza- 
tion and algorithms based on graph theory. In this latter case, graphs are generally constructed by 
taking into account the Euclidian distance as a similarity measure, and partitioned using spectral 
methods. However, these methods are not accurate when the clusters are not well separated. In 
addition, it is not possible to automatically determine the number of clusters. These limitations can 
be overcome by taking into account network community identification algorithms. In this work, we 
propose a methodology for data clustering based on complex networks theory. We compare different 
metrics for quantifying the similarity between objects and take into account three community find- 
ing techniques. This approach is applied to two real-world databases and to two sets of artificially 
generated data. By comparing our method with traditional clustering approaches, we verify that the 
proximity measures given by the Chebyshev and Manhattan distances are the most suitable metrics 
to quantify the similarity between objects. In addition, the community identification method based 
on the greedy optimization provides the smallest misclassification rates. 

PACS numbers: 89.75.Hc,89.75.-k,89.75.Kd 



INTRODUCTION 

Classification is one of the most intrinsic activities of 
human beings, being used to facilitate the handling and 
organization of the huge amount of information that we 
receive every day. As a matter of fact, the brain is able to 
recognize objects in scenes and also to provide a catego- 
rization of objects, persons, or events. This classification 
is performed in order to cluster objects that are similar 
with respect to common attributes. Actually, humans 
have by now classified almost all known living species 
and materials on earth. Due to the importance of the 
classification task, it is fundamental to develop methods 
able to perform this task automatically. Indeed, many 
methods for categorization have been developed with ap- 
plication to life sciences (biology, zoology), medical sci- 
ences (psychiatry, pathology), social sciences (sociology, 
archaeology) , earth sciences (geography, geology), and 
engineering [J, 

The process of classification can be performed in two 
different ways, i.e. supervised classification, where the 
previously known class of objects are provided as pro- 
totypes for classifying additional objects; and unsuper- 
vised classification, where no previous knowledge about 
the classes is provided. In the latter case, the catego- 
rization is performed in order to maximize the similarity 
between the objects in each class while minimizing the 
similarity between objects in different classes. In the 
current work, we introduce a method for unsupervised 
classification based on complex networks. 

Unsupervised classification may be found under differ- 



ent names in different contexts, such as clustering (in 
pattern recognition), numerical taxonomy (in ecology) 
and partition (in graph theory) . In the current work, we 
adopt the term "clustering". Clustering can be used in 
many tasks, such as data reduction, performed by group- 
ing data into cluster and processing each cluster as a 
single entity; hypothesis generation, when there is no in- 
formation about the analyzed data; hypothesis testing, 
i.e. verification of the validity of a particular hypothe- 
sis; and prediction based on classes, where the obtained 
clusters are based on the characteristics of the respective 
patterns. As a matter of fact, clustering is a fundamental 
tool for many research fields, such as machine learning, 
data mining, pattern recognition, image analysis, infor- 
mation retrieval, and bioinformatics [JQ- 

Many methods have been developed for data cluster- 
ing many of which are based on graph theory 
Graphs-based clustering methods take into account al- 
gorithms related to minimum spanning trees region 
of infiuence (e.g. Q), direct trees Q and spectral anal- 
ysis These methods are able to detect clusters of 
various shapes, at least for the case in which they are 
well separated. However, these algorithms present some 
drawbacks, such as the spectral clustering, which only 
divides the graph into two groups and not in an arbi- 
trary number of clusters. Division into more than two 
groups can be achieved by repeated bisection, but there 
is no guarantee of reaching the best division into three 
groups Also, these methods give no hint about how 
many clusters should be identified. On the other hand, 
methods for community identification in networks are 
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able to handle these drawbacks [9|. Moreover, these 
methods provide more accurate partitions than the tra- 
ditional method based on graph, such as the spectral 
partition 0|. Actually, methods based on complex net- 
works are improvements of clustering approaches based 
on graphs. 

Only recently, a method has been developed for data 
clustering based on complex networks concepts In 
this case, the authors proposed a clustering method based 
on graph partitioning and the Chameleon algorithm [llj . 
Although this method is able to detect clusters in dif- 
ferent shapes, it presents some drawbacks. The authors 
considered a method for community identification very 
particular which does not provide the most accurate net- 
work division In addition, it considered only a single 
metric to establish the connections between every pair 
of objects, i.e. the Euclidian distance. On the other 
hand, the method introduced in the current work over- 
comes all these limitations. We adopt the most accurate 
community identification methods and use the most tra- 
ditional metrics to define the similarity between objects, 
including the Euclidian, Manhattan, Chebyshev, Fu and 
Taninioto distances 01 ■ The accuracy of our method- 
ology is evaluated in artificial as well as two real-world 
databases. Moreover, we compare our methodology with 
some traditional clustering algorithms, i.e. k-means, cob- 
web, expectation maximization and farthest first. We 
verify that our approach provides the smallest error rates. 
So, we concluded that complex networks theory seems to 
provide the tools and concepts able to improve the clus- 
tering methods based on graphs, potentially overcoming 
the most traditional clustering methods. 



CONCEPTS AND METHODS 



vertices present similar roles, such as in the case of the 
brain of mammals, where cortical modules are associ- 
ated to brain functions [l^ ■ Communities have the same 
principle as clusters in pattern recognition research. In 
this way, the algorithms developed for community iden- 
tification can also be used to partition graph and finding 
clusters. 

Different methods have been developed in order to find 
communities in networks. Basically, these methods can 
be grouped as spectral methods (e.g. [l3|), divisive meth- 
ods (e.g. agglomerative methods (e.g. [ill), and lo- 
cal methods (e.g. US)- The choice of the best method 
depends of the specific application, including the net- 
work size and number of connections. This is due to 
the fact that the most precise methods, such as the ex- 
tremal optimization algorithm, are quite time expensive. 
Here, we take three different methods that provide accu- 
rate results, but have different time complexities. These 
methods are described in the next section. 

The quality of a particular network division can be 
evaluated in terms of the modularity measure. This met- 
ric allows the number of communities to be automati- 
cally determined according to the best network partition. 
For a network partitioned into m communities, a matrix 
E, c X c, is constructed whose elements Cij, represent 
the fraction of connections between communities i and j. 
The modularity Q is calculated as 



Q-J2^eu-iJ2e,,r] = TTE-\\E^ 



(1) 



The highest value of modularity is obtained for the best 
network division. In particular, networks that present 
high values of Q have modular structure implying that 
clusters are identified with high accuracy (ol. Il5|. 



Complex networks 



Clustering based on network 



Complex networks are graphs with non-trivial topo- 
logical features, whose connections are distributed as a 
power- law [l3| . An undirected network can be repre- 
sented by its adjacency matrix A, whose elements Uij 
are equal to one whenever there is a connection between 
the vertices i and j, or equal to zero otherwise. A more 
general representation takes into account weighted con- 
nections, where each edge {i,j) presents an associated 
weight or strength ui{i,j). 

Different measures have been developed to characterize 
the topology of network structures, such as the clustering 
coefficient, distance- related measurements and centrality 
metrics |14| . By allowing the different network properties 
to be quantified, these methods have revealed that most 
real- world networks are far from purely random [l5| . 

In addition to this highly intricate topological organi- 
zation, complex networks also tend to present modular 
structure. In this case, these modules are clusters whose 



In literature, there are many definitions of clusters 
such as that provided by Everitt et al. where clusters 
are understood as continuous regions of the feature space 
containing a high density of points, separated from other 
high density regions by low density regions. This defi- 
nition is similar to that of network communities, i.e. a 
community is topologically defined as a subset of highly 
inter-connected vertices which are relatively sparsely con- 
nected to nodes in other communities 



Let each object (also denominated pattern) be repre- 
sented by a feature vector x = [xi, a;2, . . . , a;„]. These 
features, scalar numbers and quantify the proper- 

ties of objects. For instance, in case of the Iris database, 
the objects are flowers and the attributes are the length 
and the width of the sepal and the petal, in centime- 
ters The clustering approach consists of grouping 
the feature vectors into m clusters, Ci, C2, . . . , Cm, in 
such a way that objects belonging to the same cluster 
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exhibit higher similarity with each other than with ob- 
jects in other groups. 

The process of clustering based on networks involves 
the definition of the following concepts: 

1. Proximity measure: each object is represented as a 
node, where each pair of nodes are connected ac- 
cording to their similarity. These connections are 
weighted in the sense so as to quantify how simi- 
lar each pair of vertices is, in terms of their feature 
vector. In this way, the most similar objects are 
connected by the strongest edges. 

2. Clustering criterion: modularity is the most tra- 
ditional measure used to quantify de quality of a 
network division [1^, see Equation [1] Here, we 
adopt this metric to automatically choose the best 
cluster partition. In problems in which the number 
of clusters is known, it is not necessary to consider 
the modularity. 

3. Clustering algorithms: Complex networks theory 
provides many algorithms for community identifi- 
cation, which act as the clustering algorithms [l^ . 
The choice of the most suitable method for a partic- 
ular application should take into account the error 
rate and the execution time. 

4. Validation of the results: The validation of the 
clustering methods based on networks can be per- 
formed in two different ways: (i) by considering 
databases in which the clusters are known (or at 
least expected), such as the Iris database 21|, and 
(ii) by taking into account artificial data with clus- 
ter organization, which allows to control the level 
of the data modular organization. 

Proximity measures can be classified into two types, 
similarity measures, that is s{x,y) = Sq only if x — y 
and — oo < s{x, y) < sq < +cxd; and dissimilarity mea- 
sures, where d{x, y) = do only ii x = y and — oo < dp < 
d{x,y) < +O0. To construct networks, it is more natural 
to adopt similarity measures, since it is expected that the 
edges with the strongest weights should be verified be- 
tween the vertices with the most similar feature vectors. 
In this way, we adopt the following similarity measures 
to develop the network-based clustering approach 0|: 

1 . Inverse of Euclidian distance: 

where d2{x, y) is the traditional Euclidian distance. 
This metric results in values in the interval [0, oo). 



2. Exponential of Euclidian distance: 

'S'£;(x,y) = Q:exp(-ad2(x,y)) , 



(3) 



where this metric results in values in the interval 
[0,a]. 



3. Inverse of Manhattan distance 



(4) 



which assumes values in [0,oo). 

4. Exponential of Manhattan distance: 

assuming values in the interval [0, a]. 

5. Inverse of Chebyshev distance: 



(5) 



(6) 



max;'^^ \Xi ~ yi 
This metric results in values in [0, oo). 

6. Exponential of Chebyshev distance: 

Sci^,y) ^ aexp{-aDc) , (7) 
assuming values in the interval [0,a]. 

7. Metric proposed by Fu, 

jpf \ 1 c?2(x,y) 

"^^"'^^^^' iixii + iiyir 

This metrics results in values in the interval [0, 1] 

8. Exponential of the metric proposed by Fu: 

1 - F 



^^(x, y) = aexp 



(9) 



If F(x,y) = 1, then Spi^^y) = a. If ^^(x,y) = 0, 
then 5i?(x,y) = aexp(^). Therefore, Sp as- 
sumes values in this limited interval. 



9. Exponential of the Tanimoto mesure: 

1 -T' 



Si 



aexp —a 



where 



nx,y) = 



T 



x^y 



(10) 



(11) 



This metric assumes values in (— cxd, 1]. Therefore, 
if T(x,y) = 1, then, ST(x,y) = a, if T(x,y) ^ 
— oo, then, S't(x, y) -> 

In order to divide networks into communities and 
therefore obtain the clusters, we adopt three methods, 
namely the maximization of the modularity method, 
which is based on the greedy algorithm [l^, here called 
fastgreedy algorithm; the extremal optimization ap- 
proach and the waltrap method In both former 
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methods, two communities i and j are joined according to 
the increase of the modularity Q of the network. Thus, 
starting with each vertex disconnected and considering 
each of them as a community, we repeatedly join com- 
munities together into pairs, choosing at each step the 
merging that results in the greatest increase (or smallest 
decrease) of the modularity Q. The best division corre- 
sponds to the partition that resulted in the highest value 
of Q. The difference between these two methods lies in 
the choice of the optimization algorithm. On the other 
hand, the walktrap method is based on random walks, 
where the community identification uses a metrics that 
considers the probability transition matrix [ist . The time 
of execution of the walktrap method run as 0(iV^ log TV). 
While the fastgreedy method is believed to be the fastest 
one, running in 0(7Vlog^ A^), the extremal optimization 
provides the most accurate division On the other 

hand, the extremal optimization method is not particu- 
larly fast, scaling as 0{N^ log N). 

The validation of the network-based clustering method 
is performed with respect to artificial {i.e. computer gen- 
erated clusters) and real-world databases. In the case of 
artificial data, we use two different configurations, i.e. (i) 
two separated clouds of points with a Gaussian distribu- 
tion in a two dimensional space, and (ii) two semi-circles 
with varying density of points, as presented in Figure [1] 
In the former case, the validation set consists of two set of 
points (clusters) generated according to a gaussian distri- 
bution with covariance matrix equals to identity, [T, = I). 
The median of one set of points is moved from the ori- 
gin (0,0) until (0,15), in steps of 0.75, while the other 
cluster remains fixed at the origin of axis. In this way, 
the distance between clusters is varied from d = to 
d = 15. Figure [IK a) to (c) shown three cases considering 
three distances, i.e. d = 0, 3 and 15. Observe that as d 
increases, the cluster identification becomes easier. The 
second artificial database corresponds to a classic prob- 
lem in pattern recognition It consists of two sets of 
points uniformly generated in two limited semi-circle ar- 
eas. In this case, the density of points, i.e. the number 
of points by unit of area, defines the cluster resolutions, 
with higher density producing more defined clusters. In 
our analysis, this density is varied from 1 to 32, in steps 
of 1.6. Figures [Ijd) to (f) show three configurations of 
this artificial database generated by taking into account 
three different densities, p = 1, 6.4 and 14.4. 

With respect to real-world databases, we take into 
account two datasets, i.e. the Iris database [2l|, and 
the Breast Cancer Wisconsin database [25|. The Iris 
database is composed by three species of Iris flowers {Iris 
setosa, Iris virginica and Iris versicolor). Each class con- 
sists of 50 samples, where four features were measured 
from each sample, i.e. the length and the width of the 
sepal and the petal, in centimeters. On the other hand, 
the cancer database is composed by features of digitized 
image of a fine needle aspirate from a breast mass, where 



30 real-valued features are computed for each cell nu- 
cleus [25|. This database is composed by 699 cells, where 
241 are malignant and 458 are benign. 



RESULTS AND DISCUSSION 

The accuracy of the clustering method based on net- 
works is compared with four traditional clustering meth- 
ods, namely k-means, cobweb, farthest first and expec- 
tation maximization (EM) [26|. These methods present 
different properties, such as the k-means tendency to find 
spherical clusters [4|. Moreover, we consider three meth- 
ods for community identification, namely fastgreedy, ex- 
tremal optimization and walktrap [l^ . However, in this 
work, since the fastgreedy and extremal optimization re- 
sult in the same error rates for all considered databases, 
we discuss only the results of the fastgreedy method, 
which is faster than the extremal optimization approach. 

We start our analysis by taking into account the Iris 
and the Breast Cancer Wisconsin databases. As a pre- 
liminary data visualization, we project the patterns into 
a two dimensional space by taking into account princi- 
pal component analysis. Figure [5] shows the projections. 
It is clear that there is no clear separation between the 
clusters for both databases. 

Since the attributes in the Iris data present different 
ranges, having values such as 0.1 for the petal width and 
7.2 for the sepal length, it is necessary to take into ac- 
count a feature standardization procedure 0|. In this 
case, each attribute is transformed in order to present 
mean equals to zero and standard deviation equals to 
one. This transformation, called standardization, is per- 
formed as, 



Vf 



(12) 



where xj, are the average and standard deviation 
of the values of attribute /, respectively. The obtained 
results considering the four clustering algorithm is pre- 
sented in Table H] The EM and k-means exhibit the 
smaller errors among the traditional classifiers. However, 
note that this performance is obtained when the number 
of clusters is known. On the other hand, EM provides 
an error of 40% when the number of clusters is unknown. 
This is a limitation of these methods, since in most of 
the cases, the information about the number of classes is 
not available. 

Table H] also presents the results with respect to the 
cluster-based on complex networks approaches. Only 
combinations between metric and community algorithm 
which result in the smallest error rates are shown in this 
table. The smallest error was obtained by taking into 
account the inverse of the Chebyshev distance and the 
fastgreedy community identification algorithm. The ob- 
tained error for this case is equal to 4.7%. The second 
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FIG. 1: Examples of the artificial databases for clustering evaluation error. The first set points is generated according to a 
Gaussian distribution, where the two sets are separated by distances (a) c? = 0, (b) d = 3 and (c) d = 15. In the case of the 
second set, a higher density of points provides more defined clusters, as shown in examples (d) p — 1.0, (e) p — 6.4 and (f) 
p = UA. 



best performance is obtained by considering the inverse 
of the Euclidian distance and the fastgreedy or the walk- 
trap algorithms, which provide an error of 6%. In addi- 
tion to the smallest error rates, network-based clustering 
present other important feature, i.e. it is not necessary 
to specify the number of clusters present in the database. 
Indeed, the maximum value of the modularity suggests 
the most accurate partition. Nevertheless, for some prox- 
imity measures, the modularity is not able to determine 
the best partition. In this case, the knowledge about 
the number of clusters implies in a reduction of the error 
rates, as in the case of exponential of the Tanimoto dis- 
tance, where the error is reduced from 33.3% to 6%, and 
the exponential of the Chebyshev distance, where the er- 
ror is reduced from 33.3% to 7.3%. Therefore, for the 
Iris data, such metrics are not appropriated for network- 
based clustering. We also analyze the clustering error 
without standardization. In this case, the error rates 
are larger than those obtained considering the normal- 
ization, for some cases. However, for the best results, we 
verify that the errors are similar in both cases. Figure [3] 
presents the dendrogram obtained for the best separa- 
tion, i.e. by taking into account the inverse of Cheby- 
shev distance and the fastgreedy community identifica- 
tion method. Observe that the best partition is obtained 
for the highest value of the modularity measure. 



TABLE I: Clustering errors for the Iris database considering 
the cases in which the number of classes k is known (k = 3) 
or unknown {k =?). EM and k-means are the only methods 
that need to specify the number of clusters k. 



Method 


% error (fc =?) 


% error (fc = 3) 


k-means 




11.3 


cobweb 


33.3 




farthest first 




14.0 


EM 


40.0 


9.3 


- fastgreedy 


6.0 


6.0 


_D^^ - walktrap 


6.0 


6.0 


Se - walktrap 


33.3 


14.7 


St - walktrap 


33.3 


6.0 


Sf - walktrap 


33.3 


7.3 


D'^l - fastgreedy 


33.3 


6.0 


D^/ - walktrap 


33.3 


6.0 


Sm - walktrap 


33.3 


6.0 


D^^ - fastgreedy 


4.7 


4.7 


Dq^ - walktrap 


9.3 


9.3 


Sc - Walktrap 


33.3 


7.3 



The cancer database also needs to be pre-processed by 
the standardization. The obtained clustering errors are 
presented in Table jn] Only combinations between met- 
ric and community algorithm which result in the smallest 
error rates are shown in this table. In this case, the small- 
est clustering error is obtained by the k-means method, 
which produces an error rate of 7.2%. However, the com- 
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1st PCA component 
(a) 



1st PCA component 

(b) 



FIG. 2: Projection of (a) the Iris and (b) the Breast Cancer 
Wisconsin databases by principal component analysis. 



plex networks-based method taking into account the in- 
verse of Manhattan distance and walktrap algorithm for 
community identification provides an error rate of 7.9%. 
Observe that when the number of clusters is known, all 
methods result in smaller error rates. Nevertheless, the 
Z?^^^-walktrap produces the same error rate of 7.9% even 
when k = 2. Therefore, the highest value of the modular- 
ity accounts for the separation for this method. Although 
our proposed method implied in an higher error than the 
k-means methodology, it presents the advantage that it 
is not necessary to known the number of clusters. In this 
way, our methodology is also more suitable to determine 
the clusters for the Breast Cancer Wisconsin database. 



In order to provide a more comprehensive evaluation 
of the proposed complex networks-based clustering ap- 
proach, we generated two set of artificial data into a 
two dimensional space, as discussed in the last section. 
This artificial data allows to control the cluster separa- 
bility of the generated databases. Initially, we consider 
two clusters of points with Gaussian distribution in an 
two-dimensional space separated by a distance d. Fig- 
ure |4] presents the best obtained results for the complex 
networks-based approach taking into account different 
proximity measures. For all cases, the number of clus- 
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FIG. 3: Dendrogram obtained by taking into account the in- 
verse of Chebyshev distance and the fastgreedy community 
identification algorithm for the Iris data. The cut in the den- 
drogram results in three classes, where the error rate is equal 
to 4.7%. 



TABLE II: Clustering errors for the Breast Cancer Wisconsin 
database considering the cases in which the number of classes 
k is known {k = 3) or unknown (fc =?). EM and k-means are 
the only methods that need to specify the number of clusters 
k. 



Method 


% error (k=?) 


% error (k = 3) 


k-means 




7.2 


cobweb 


37.2 




farthest first 




35.3 


EM 


75.9 


8.8 


- walktrap 


52.9 


9.8 


Sf - fastgreedy 


17.6 


17.6 


D^/ - walktrap 


7.9 


7.9 


- fastgreedy 


50.8 


15.3 


Sc - walktrap 


15.9 


15.9 



ters is determined automatically by the maximum value 
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FIG. 4: Error rates obtained according to tiie separation of 
two clusters composed by sets of points with gaussian distri- 
bution in a two-dimensional space. We show the best results 
for each proximity measure, i.e. (a) Euclidian, (b) Tanimoto, 
(c) Fu, (d) Manhattan and (e) Chebyshev distances. The 
number of clusters is determined automatically by the max- 
imum value of the modularity for all cases. In (f) the most 
accurate network-based method is compared with the best 
traditional clustering approach. Each point is an average over 
10 simulations. 



FIG. 5: Error rates obtained according to the separation of 
two clusters composed by sets of points with Gaussian dis- 
tribution in a two-dimensional space for the case where the 
number of clusters is known. We show the best results for 
each proximity measure, i. e. (a) Euclidian, (b) Tanimoto, (c) 
Fu, (d) Manhattan and (e) Chebyshev distances. The num- 
ber of clusters is set as fc = 2 for all methods. In (f) it the 
most accurate network-based method is compared with the 
best traditional clustering approach. Each point is an aver- 
age over 10 simulations. 



of the modularity. Note that the error rate goes to zero 
for d > 5. Figure d^f) shows the comparison between the 
traditional clustering method which resulted in the best 
results, i.e. the cobweb, and the best complex networks 
approach. In this case, the method based on the expo- 
nential of the Chebyshev distance and fastgreedy algo- 
rithm provides the smallest error rate. Observe that the 
variation of the error rate is also small for this method, 
compared with the cobweb. 

The k-means algorithm cannot be used in the com- 
parison where the number of clusters k is known. Thus, 
we consider the case where k is determined for all meth- 
ods. Figure [5] presents the obtained results. In all cases, 
the error rate goes to zero for d > 5. As in the case of 
unknown number of clusters, the method based on the 
exponential of the Chebyshev distance and fastgreedy al- 
gorithm provides the smallest error rate. Among the tra- 



ditional algorithms, the k-means allows the most accurate 
results. In fact, for d > 3, both k-means and network- 
based clustering method provide similar error rates. For 
this database, accurate results were expected for the k- 
means method, since the clusters are symmetric around 
the means, being equally distributed among the two clus- 
ters. Observe that the other approaches of the complex 
networks-based method also imply in an small error rate. 
Comparing with the cases where the number of clusters is 
unknown, i.e. Figure |4] shows that the error rate is sim- 
ilar for both approaches. Therefore, the network-based 
methods result in the most accurate cluster partitions 
and have the advantage that it is not necessary to know 
the number of clusters. 

The second artificial database used to evaluate the 
classification error rates is given by the two semi-circle 
with varying density of points (see Figures [TJd) - (f)). 
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FIG. 6: The smallest clustering errors obtained for the com- 
plex networks-based methods applied in the second artificial 
dataset (see Figures [ijd) - (f)). The error rates are deter- 
mined according to the density of points. The adopted prox- 
imity measures are (a) the exponential of the Euclidian dis- 
tance, (b) the exponential of the Manhattan distance, (c) the 
inverse of the Chebyshev distance and (d) the exponential of 
the Chebyshev distance. The number of clusters is obtained 
automatically by the maximum value of the modularity. 



Figure [5] presents the obtained results for unknown num- 
ber of clusters considering fastgreedy algorithm. Only 
the best results are shown in this figure. The higher 
the density of points, the smaller the error rate, since 
the clusters become more defined. The error rate does 
not tend to zeros only for the inverse of the Chebyshed 
distance (Figure IHlJc)). The most accurate clustering is 
obtained by taking into account the exponential of the 
Manhattan distance (Figure IHl^b)). Figure [7] presents the 
obtained errors when the number of clusters is known, i. e. 
k = 2. Again, the complex networks-based method which 
takes into account the exponential of the Manhattan dis- 
tance produces the smallest error. It is interesting to note 
that the traditional clustering methods, i.e. k-means and 
cobweb, result in higher error rates than the methods 
based on complex networks. In addition, the error does 
not tend to zero when the density of points is increased 
for these traditional methods. Figure [5] presents an ex- 
ample of the best clustering for the k-means and complex 
networks-based methods. Observe that k-means cannot 
identify the correct clusters. 

CONCLUSION 

In this work, we study different proximity measures 
to represent a data set into a graph and then adopt 
community detection algorithms to perform respective 
clustering. Our obtained results suggest that complex 
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FIG. 7: The smallest clustering errors obtained for the com- 
plex networks-based methods applied in the second artificial 
dataset (see Figures [lid) - (f)). The error rates are deter- 
mined according to the density of points. The adopted prox- 
imity measures are (a) the exponential of the Euclidian dis- 
tance, (b) the exponential of the Manhattan distance, (c) the 
inverse of the Chebyshev distance and (d) the exponential of 
the Chebyshev distance. The number of clusters is fixed as 
k — 2. The k-means (e) and cobweb (f) are the traditional 
clustering methods that produce the smallest errors. 



networks theory has tools to improve graph-based clus- 
tering methodologies, since this new area provides more 
accurate algorithms for community identification. In 
fact, comparing with traditional clustering methods, the 
network-based approach finds clusters with the smallest 
error rates for both real-world and artificial databases. 
In addition, this methodology allows the identification 
of the number of clusters automatically by taking into 
account the maximum value of the modularity measure- 
ment. Among the considered proximity measures, the 
inverse of the Chebyshev distance and the inverse of the 
Manhattan distance are the most suitable metric for the 
considered real-world databases. With respect to the 
artificial databases, the exponential of the Chebyshev 
and exponential of the Manhattan distance produces the 
smallest error rates. Therefore, metrics based on the 
Chebyshed and Manhattan distances are the most suit- 
able to quantify the similarity between objects in terms of 
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FIG. 8: Example of the best performance for the (b) k-means, 
complex network-based methods method using (c) the best 
modularity value and (d) fixing k — 2. The original data is 
shown in (a). 

their feature vectors. Among the community identifica- 
tion algorithms, the fastgreedy revealed to be the most 
suitable, due to its accuracy and the smallest time for 
processing. 

The analysis proposed in this work can be extended 
by taking into account other real-world databases as well 
as other approaches to generate artificial clusters. The 
application to different areas, such as medicine, biology, 
physics and economy constitute other promising research 
possibilities. 
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